Mistle: bringing spectral library predictions to metaproteomics with an efficient search index

Abstract Motivation Deep learning has moved to the forefront of tandem mass spectrometry-driven proteomics and authentic prediction for peptide fragmentation is more feasible than ever. Still, at this point spectral prediction is mainly used to validate database search results or for confined search spaces. Fully predicted spectral libraries have not yet been efficiently adapted to large search space problems that often occur in metaproteomics or proteogenomics. Results In this study, we showcase a workflow that uses Prosit for spectral library predictions on two common metaproteomes and implement an indexing and search algorithm, Mistle, to efficiently identify experimental mass spectra within the library. Hence, the workflow emulates a classic protein sequence database search with protein digestion but builds a searchable index from spectral predictions as an in-between step. We compare Mistle to popular search engines, both on a spectral and database search level, and provide evidence that this approach is more accurate than a database search using MSFragger. Mistle outperforms other spectral library search engines in terms of run time and proves to be extremely memory efficient with a 4- to 22-fold decrease in RAM usage. This makes Mistle universally applicable to large search spaces, e.g. covering comprehensive sequence databases of diverse microbiomes. Availability and implementation Mistle is freely available on GitHub at https://github.com/BAMeScience/Mistle.

Visualization of AVX2 fused multiply-add operation performed vertically on 8 intensity values (32-bit floats). The process associated with data loading into and retrieving from 256-bit destinations is displayed for a single fragment bin.

Methods
In this section we provide additional information to some of method deployed in the main article, and elaborate the data processing steps. This serves as ancillary information, and is only fully comprehensible in combination with the main text.

Continuous index construction
As explained in the main article, the fragment index needs to be constructed continuously during the reading process of the spectral library. This goes as follows: Each library spectrum is read one after the other, precursor information is added to the precursor index and all their peaks are streamed as fragment triplets to the corresponding index partition in binary format on disk. Missing (zero-intensity) theoretical fragment ions are not explicitly calculated from the peptide sequence and remain untracked unless they are provided within the spectral library.
The process is parallelized by having a single thread dedicated to reading while the rest are occupied by formatting and writing tasks. At this point, fragments are unsorted and not binned, but assigned to their partition. After the library has been read entirely, the precursor index is sorted by precursor charge and m/z, and the ID-to-rank mapping is established by a linear scan over the precursor list. Then, every partition is loaded again, one at a time, and fragments are sorted based on parent ranks, accessible via the ID. Afterwards, the partition is saved to disk in binary format. Binning only happens during the search, based on the user-specific fragment tolerance.

Search Loop
Let Q be a query spectrum with precursor m/z mz Q and a set of peaks P Q . We identify the best PSMs by computing the dot product to the reference library spec-tra, carrying out the following steps: (1) First, the range of library candidate spectra that lie within the precursor mass tolerance t is computed using the precursor index. A binary search on the sorted precursor entries swiftly finds the lower bound L and upper bound U , which is the rank of the first and last library spectrum with a precursor m/z ∈ [mz Q − t, mz Q + t].
(2) Next, a scoring vector (scores) of length U − L is allocated and initialized with all zeros. This way, the score of each candidate spectrum can be accessed from its rank R by subtracting L.
(3) Every peak p = (mz p , I p ) ∈ P Q is searched in the fragment index by first determining the fragment bin matching its ion mass mz p . Recall that inside each bin the fragments are stored in order of their parent ranks. A binary search then identifies the first fragment f = (mz f , I f , ID parent (f)) with a parent rank R greater than or equal to L (rank of the first candidate). R is derived from ID parent (f) via the ID-to-rank mapping.
(4) The fragment intensity I f is multiplied with the query peak intensity I p and the product is added to the score of the fragment's parent: The process is repeated with the next fragment in the bin until a fragment with parent rank greater than U is reached. This marks the end of candidates for that fragment bin.
(5) At the end, the scores equal the dot products between search spectrum Q and every candidate library spectrum covered by the partition. We rescore the best-scoring library spectra with an elaborate scoring function (see main article). Then, the X highest scoring library spectra are selected, and the corresponding peptides are returned as PSMs to Q. X, the number of output PSMs per query spectrum (X>0), is a parameter defined by the user.
The search function is parallelized matching each query spectrum on a separate thread. After all scheduled queries are performed, the resulting PSMs from all the partitions are concatenated and sorted by query ID. Matches assigned to the same experimental spectrum cluster together, and again only the top X ranked matches are retained, if multiple partitions produced hits for the same query.

SIMD intrinsics
Single instruction multiple data (SIMD) is a type of parallel computation, which simultaneously performs an operation on packed groups of data, instead of on every single data point individually (Amiri and Shahbahrami, 2020). Since their introduction to general-purpose processors, first by Intel's MultiMedia eXtensions (MMX) in 1996, it has been widely established as significant means to speed up performance (Amiri and Shahbahrami, 2020;Hassaballah et al., 2008;Zhou and Ross, 2002). The search algorithm requires many successive multiplication and addition operations when updating parent scores with peak intensity products. Consequently, SIMD extensions are an eligible option to improve our run time. We use the Advanced Vector Extensions AVX2 and AVX512 architectures, which support the fused multiply-add arithmetic operation (for 256-bits in C++: mm256 fmadd ps) for floating-point vectors, thereby updating multiple parent scores in parallel. A schematic version of the workflow for a 256-bit register is depicted in Supplementary Figure 2.
Explicitly, this is done by broadcasting a single 32-bit float, the query peak intensity value, into the 256 or 512-bit register (for 256-bits in C++: mm256 set1 ps) and loading respectively 8 or 16 fragment intensities from the fragment-ion bin (using mm256 loadu ps). Then, the corresponding score values are inserted into another register. Finally, the multiply-add operation (using mm256 fmadd ps) is performed vertically on all 8 or 16 values at the same time: multiplying the query intensity with the individual fragment intensities and adding the result to the corresponding scores, which are extracted afterwards. The AVX512 instruction set provides a gather instruction to quickly access the values from the scoring vector and a scatter instruction to put them back in place after the computation. Note that the fragment bins need to be adjusted in their format to enable swift loading of intensity values.

9MM
In 2013, Tanca et al. investigated the effect of sequence databases used to query shotgun proteomic results for diverse microbial communities. They evaluate their findings on a lab-assembled mock community of nine bacterial and eukaryotic species: Escherichia coli, Pasteurella multocida, Brevibacillus laterosporus, Lac-tobacillus acidophilus, Lactobacillus casei, Enterococcus faecalis, Pediococcus pentosaceus, Rhodotorula glutinis, and . The 9MM dataset has since been used to evaluate metaproteomic pipelines. We utilize the sequence database (9MM DB.fasta) and 4 search files (9MM FASP.raw, 9MM PPID.raw, 9MM Run 1.raw, 9MM Run 2.raw) provided in the original and the follow-up study, which can be found at http://www.peptideatlas.org/PASS/P and http://www.peptideatlas.org/PASS/PASS00355 (Tanca et al., 2013(Tanca et al., , 2014. Raw files were convered using ms-convert from the ProteoWizard software (Chambers et al., 2012) with peak picking retaining the top 150 most intense peaks.

SIHUMIx
The extended simplified human microbiota (SIHUMIx), established by Krause et al. (2020), is a model community of eight species from the human intestine that account for most of the typical metabolic activities in the human gut. The model allows consistent and reproducible in-vitro cultativation, making it ideal to investigate the effect of treatments to the microbiome (Schäpe et al., 2020). SIHUMIx consists of the species: Anaerostipes caccae, Bifidobacterium longum, Bacteroides thetaiotaomicron, Blautia producta, Clostridium butyricum, Clostridium ramosum, Escherichia coli and Lactobacillus plantarum.
As reference sequence database we use the one provided by the CAMPI challenge (Van Den Bossche et al., 2021), which already includes decoy and contaminant sequences (Pride Project ID: PXD023217). Searches are performed on two large search files (S05, S06) from the CAMPI study, which yielded the most identified PSMs without fractionation.